-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Kestrel connection metrics with error reasons #55565
Conversation
@@ -1727,7 +1727,7 @@ public async Task AbortConnectionAfterTooManyEnhanceYourCalms() | |||
await WaitForConnectionErrorAsync<Http2ConnectionErrorException>( | |||
ignoreNonGoAwayFrames: true, | |||
expectedLastStreamId: int.MaxValue, | |||
expectedErrorCode: Http2ErrorCode.INTERNAL_ERROR, | |||
expectedErrorCode: Http2ErrorCode.ENHANCE_YOUR_CALM, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amcasey I added a HttpConnection.Abort
overload that has an Http2ErrorCode
argument. When there are excessive resets I made the passed in value ENHANCE_YOUR_CALM
instead of INTERNAL_ERROR
. That seems a more appropriate error.
Did you choose INTERNAL_ERROR
for a reason here, or is it returned because that is what abort already sent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a hotfix, so I suspect I was making the smallest change possible, but I no longer recall.
7c05f97
to
8a35061
Compare
What was the motivation to add the data initially? I can hypothesize a reason someone might find it useful (diagnosing why connections didn't live as long as they were expected to live) but I don't know how often someone needs to diagnose that issue in practice and what alternatives would they have available if this metric attribute didn't exist?
Any attempt at classification gets very subjective but a few suggestions at least:
|
src/Servers/Kestrel/Core/src/Internal/Http/Http1OutputProducer.cs
Outdated
Show resolved
Hide resolved
I think it's valuable to know the end reason even if it's not an error and even if we don't have a specific scenario in mind. The change looks great from semconv perspective! LMK if you need any help adding new attributes to otel. |
This seems like data that should be going into logs not metrics. Metric dimensions should be kept small as adding additional properties and values adds incremental storage. If we can reduce this data down to one property - was the connection closed cleanly or due to an error, and provide logs to give more details as to why. Metrics are supposed to be about aggregated statistics not detailed analysis as to what happened. Anyone who is going to be doing analysis, can the use the metric to determine if the trend is changing, and then go to a log query to understand why. Once exemplars are better supported in .NET/OTel they will provide the correlation between metrics and traces/logs. |
There is prior art for showing more error info in
Note that while the
That is similar to how there are many HTTP status codes, but most of them you'll never see. |
@@ -72,7 +72,7 @@ protected override Task OnConsumeAsync() | |||
_context.ReportApplicationError(connectionAbortedException); | |||
|
|||
// Have to abort the connection because we can't finish draining the request | |||
_context.StopProcessingNextRequest(); | |||
_context.StopProcessingNextRequest(ConnectionEndReason.AbortedByApplication); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels like we might want to distinguish between "the application requested an abort" and "the application failed and we had to abort".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the assumption here is that the only reason TryReadInternal
would throw an InvalidOperationException
is because middleware exited after calling HttpContext.Request.BodyReader.ReadAsync
without a corresponding call to AdvanceTo
, and rather than try to recover we bail out since we assume it's an unlikely scenario anyway.
I should have created a custom message explaining this rather than reuse CoreStrings.ConnectionAbortedByApplication
for the exception passed to ReportApplicationError
.
Ideally, we'd use "application_abort" only for HttpContext.Abort()
on HTTP/1.1 connections, and then use "application_left_bodyreader_in_invalid_state" for this, and "application_canceled_response_flush" for the usage of AbortedByApplication
in TimingPipeFlusher
.
If we reused the same ConnectionEndReason
for HttpCotnext.Abort()
and this issue with leaving the body reader in an invalid state, I won't be too upset since this scenario is pretty niche. Cancelling a response body flush on an HTTP/1.1 connection probably deserves its own ConnectionEndReason
though.
src/Servers/Kestrel/Core/src/Internal/Http2/Http2FrameWriter.cs
Outdated
Show resolved
Hide resolved
@@ -530,8 +544,14 @@ private void UpdateStreamTimeouts(long timestamp) | |||
await outboundControlStreamTask; | |||
} | |||
|
|||
// Use graceful close reason if it has been set. | |||
if (reason == ConnectionEndReason.Unknown && _gracefulCloseReason != ConnectionEndReason.Unknown) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I don't think && _gracefulCloseReason != ConnectionEndReason.Unknown
actually accomplishes anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean? Is it that _gracefulCloseReason != ConnectionEndReason.Unknown
condition could be removed and an unknown value is set to unknown?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's been a while, but I probably meant that reason == ConnectionEndReason.Unknown
implies _gracefulCloseReason != ConnectionEndReason.Unknown
at this point in the code. I may well have been mistaken.
With the notable disclaimer that I haven't actually monitored the health of a real service... I think I'd want to have it for all connections. Sure, the majority will be graceful shutdowns (hopefully), but it's (I assume) easy enough to filter out a few uninteresting values and which values are uninteresting seems (as mentioned above) very subjective. Another reason to include things that don't seem like errors is for making attack signatures. Something that's "normal" for a single connection may be problematic when it happens on many connections simultaneously. Or it may not be problematic, but still indicate that a change that signals a problem will soon occur. Personally, I think I'd have just end_reason and drop error_type altogether (with the caveat that we might want a special way to represent unhandled exceptions). If we go the other way, we have all sorts of strange cases to consider like whether cutting off a slow-reading client is an error. Arguably, that's the system functioning as expected and desired, but it's still interesting. |
@samsp-msft I'm not sure I'm following this line of reasoning. Given that logs are structured and queryable, why would anything not be logs-only? When would you use metrics instead? |
eb1b413
to
6ed61b7
Compare
src/Servers/Kestrel/Core/src/Internal/Infrastructure/PipeWriterHelpers/TimingPipeFlusher.cs
Outdated
Show resolved
Hide resolved
9b036f4
to
4c8f48d
Compare
4b4839f
to
c510461
Compare
d528f01
to
7b517ea
Compare
src/Servers/Kestrel/Core/src/Internal/Infrastructure/KestrelMetrics.cs
Outdated
Show resolved
Hide resolved
@@ -2395,7 +2415,9 @@ public async Task ConnectionClosedAfter101Response() | |||
[Fact] | |||
public async Task ThrowingResultsIn500Response() | |||
{ | |||
var testContext = new TestServiceContext(LoggerFactory); | |||
var testMeterFactory = new TestMeterFactory(); | |||
using var connectionDuration = new MetricCollector<double>(testMeterFactory, "Microsoft.AspNetCore.Server.Kestrel", "kestrel.connection.duration"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider extracting a helper for new MetricCollector<double>(testMeterFactory, "Microsoft.AspNetCore.Server.Kestrel", "kestrel.connection.duration")
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is big enough that I could keep giving feedback indefinitely, but I have no fundamental concerns and would be okay with merging it.
FlowControlWindowExceeded, | ||
KeepAliveTimeout, | ||
InsufficientTlsVersion, | ||
InvalidHandshake, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems similar to UnexpectedFrame
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean?
InvalidHandshake is only used for HTTP/2 and occurs when the client doesn't start the connection with the HTTP/2 preface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that comment was supposed to be about FrameAfterStreamClose
. My bad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I wonder if I was commenting on an older commit. I can't find a refresh button in the VS Code UI that seems to cover moving forward to the latest revision of the PR.
src/Servers/Kestrel/Core/src/Internal/Infrastructure/KestrelTrace.General.cs
Outdated
Show resolved
Hide resolved
@@ -468,20 +473,24 @@ private void UpdateStreamTimeouts(long timestamp) | |||
{ | |||
Log.RequestProcessingError(_context.ConnectionId, ex); | |||
error = ex; | |||
reason = ConnectionEndReason.IOError; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'd be fine to say it's some sort of transport level error.
} | ||
|
||
public void OnInputOrOutputCompleted() | ||
{ | ||
TryStopAcceptingStreams(); | ||
|
||
// Abort the connection using the error code the client used. For a graceful close, this should be H3_NO_ERROR. | ||
Abort(new ConnectionAbortedException(CoreStrings.ConnectionAbortedByClient), (Http3ErrorCode)_errorCodeFeature.Error); | ||
Abort(new ConnectionAbortedException(CoreStrings.ConnectionAbortedByClient), (Http3ErrorCode)_errorCodeFeature.Error, ConnectionEndReason.TransportCompleted); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like I mentioned in my other comment, I think indicating that this was client initiated would be good. If ClientFin
is too TCP specific, we could go with ClientClosedTransport
or something.
src/Servers/Kestrel/Core/src/Internal/Infrastructure/KestrelMetrics.cs
Outdated
Show resolved
Hide resolved
/benchmark plaintext aspnet-citrine-win kestrel |
Benchmark started for plaintext on aspnet-citrine-win with kestrel. Logs: link |
An error occurred, please check the logs |
/benchmark plaintext aspnet-citrine-win kestrel |
Benchmark started for plaintext on aspnet-citrine-win with kestrel. Logs: link |
plaintext - aspnet-citrine-win
|
/benchmark |
Crank Pull Request Bot
Benchmarks:
Profiles:
Components:
Arguments: any additional arguments to pass through to crank, e.g. |
/benchmark plaintext aspnet-citrine-win kestrel |
Benchmark started for plaintext on aspnet-citrine-win with kestrel. Logs: link |
plaintext - aspnet-citrine-win
|
Fixes #53358
This PR improves the kestrel.connection.duration metric by adding information about why the connection ended.
Tags added or changed:
http.connection.protocol_code
- Standards based, publicly communicated code for the connection ending. Often sent to the client. The value is either the HTTP/2 error code or HTTP/3 error code that the connection ended with. Doesn't apply to HTTP/1.1.kestrel.connection.end_reason
- A more detailed internal reason about why a connection ended. Always present.error.type
- If the end reason is an error, e.g. connection killed because of timeout, excessive stream resets, transport ended while there are active HTTP requests, invalid frame, etc.error.type
could be the same as the end reason. The value could also be an exception type name if an unhandled exception is thrown from connection middleware pipeline. The first error recorded wins. Only present if there is an error.We need to decide what the new tags is. Is it the end reason or error reason? For example, a connection can be ended by the client sending a go away frame. Do we want to track that? It's not an error to close a connection like that, so in this case we're are tracking all end reasons. On the other hand, we might want to only track unexpected and unhealthy reasons that end the connection. The benefit of only tracking errors is it is very easy to get just the unhealthy connections by filtering metrics to ones that have this tag. However, I think we can achieve this by also putting error reasons into the
error.type
tag.tldr
kestrel.connection.end_reason
is always present and provides the reason,error.type
is present if the end reason is considered an error and has the same value askestrel.connection.end_reason
.Questions for folks:
kestrel.connection.end_reason
and just haveerror.type
.Tasks:
cc @noahfalk @tarekgh @lmolkova @samsp-msft